<html><head>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2">
<title>68000 TRICKS AND TRAPS</title>
</head><body>
<center>
<script type="text/javascript"><!--
	google_ad_client = "pub-0565437031849443";
	/* 728x90, Paulrsm */
	google_ad_slot = "7189181649";
	google_ad_width = 728;
	google_ad_height = 90;
	//-->
</script>
<script style="display: none;" type="text/javascript" src="trick68k_pliki/show_ads.js">
</script>
<p>
</p></center>
<font size="4"><a href="http://www.easy68k.com/index.html">EASy68K Home</a>
</font>

<h1>68000 TRICKS AND TRAPS</h1>
<h2>Some assembly language programmng guidelines</h2>
<p>BYTE magazine, September 1986</p>
<p>by Mike Morton</p>
<p>THE ERA OF HIGH-LEVEL LANGUAGES has not made assembly language
coding a dead art, even on modern microprocessors designed for
executing compiled high-level code. Although personal computers are
approaching the power of mainframes, the way to get the most out of any
processor is to know when to use assembly language. The popular
Motorola MC68000 processor is a good example; it has a fairly regular
instruction set and instructions to support features of such languages
as Pascal and C. Yet the instruction set is not perfectly orthogonal --
warts in the design and implementation make the architecture
interesting for hand-coded assembly programs.</p>
<p>The continuing usefulness of assembly language, even on this
processor, is apparent in recent industry products. The Macintosh ROM,
for instance, is written entirely in assembly language, yielding
considerable savings in memory and speed.</p>
<p>In this article I'll survey some of the subtleties of the 68000 to
help you avoid its pitfalls and exploit its oddities for better speed
or memory use. I assume that you have some experience using the 68000;
if not, see the bibliography at the end of this article.</p>
<h2>TRAPS FOR THE BEGINNER</h2>
<p>Most of the 68000's "traps" have a reason behind them; unintuitive
aspects of the processor may actually be more useful, easier to
implement, or correct in the view of the 68000 designers.</p>
<p>One trap is memory alignment. Although the 68000 supports byte,
word, and long-word operations, word and long-word operands must be
aligned on word boundaries (even addresses). This is because memory is
grouped in words (2 bytes) and accessed via a 16-bit bus. Instructions
must be word-aligned also, but assemblers and linkers normally do this
for you.</p>
<p>Another trap is stack direction. The 68000 stack "grows" toward low
memory. This means that to allocate stack space you should subtract
from the stack pointer: <tt>SUB&nbsp;#size,SP</tt>. To deallocate
space (or to discard previously pushed values), add to the stack
pointer. Equally confusing, when allocating local storage with the <tt>LINK</tt> instruction, you must specify a negative displacement to be added to the stack pointer.</p>
<p>The stack pointer (<tt>SP</tt>) must stay word-aligned. If you push or pop a byte through the <tt>SP</tt>, the processor will move a word to or from the stack, placing the relevant byte in the high-order half of that word. Only the <tt>SP</tt>
behaves this way; other address registers act the way you'd expect.
This may seem an anomaly in an orthogonal architecture, but the <tt>SP</tt> must stay word-aligned so that words and long words are pushed to even addresses.</p>
<p>Shift and rotate instructions can operate on the byte, word, or
long-word part of a data register, but shifts of memory operands can be
only word size. Data registers can be shifted by up to 32 bits if the
shift count is specified in another register or by up to 8 bits if the
shift count is a constant given in the instruction. Memory can be
shifted by only 1 bit.</p>
<p>The syntax of two-operand instructions may be reversed from other machines you're used to. For instance, the 68000 instruction <tt>MOVE.W&nbsp;D0,D1</tt> is equivalent to <tt>LOAD&nbsp;D1,D0</tt> on some other machines; that is, the contents of the <tt>D0</tt> register are moved into <tt>D1</tt>. On the 68000, the destination register -- the one affected by the instruction -- is always second. The operand order for <tt>CMP</tt> instructions is also reversed from some older machines; therefore you would read <tt>CMP&nbsp;D0,D1</tt> as "compare <tt>D1</tt> to <tt>D0</tt>." (But beware: Some assemblers reverse the order of the operands from Motorola's standard; UNIX assemblers often do this.)</p>
<p>The 68000 provides the comparison operations shown in table 1. This
includes not only all six possible relationships between two numbers,
but also whether numbers are compared as signed or unsigned quantities.
(Comparing the word values <tt>0006</tt> and <tt>FFFE</tt> hexadecimal
depends on how the numbers are interpreted. If they're signed numbers,
6 is greater than -2. But if they're addresses, they're unsigned, and <tt>0006</tt> is a lower address than <tt>FFFE</tt>.)</p>
<hr>
<p><b>Table 1:</b> <i>This table shows which branch instructions will result in a branch taken when testing for a given relationship of</i> <tt>D1</tt> <i>to</i> <tt>D0</tt> <i>after a</i> <tt>CMP&nbsp;D0,D1</tt> <i>instruction.</i></p>
<pre> Relationship    Signed     Unsigned
 -------------------------------------------------------
 D1 &lt;  D0        BLT        BCS (branch on Carry Set)
 D1 &lt;= D0        BLE        BLS
 D1 =  D0        BEQ        BEQ
 D1 &lt;&gt; D0        BNE        BNE
 D1 &gt;  D0        BGT        BHI
 D1 &gt;= D0        BGE        BCC (branch on Carry Clear)</pre>
<hr>
<p>The confusing thing is that the expected unsigned equivalents of <tt>BLT</tt> and <tt>BGE</tt> are not <tt>BHS</tt> (branch on higher or same) and <tt>BLO</tt> (branch on lower). Instead, Motorola uses <tt>BCS</tt> and <tt>BCC</tt>,
respectively. The processor is perfectly orthogonal, providing for all
types of comparisons. But the mnemonics are asymmetrical on the
unsigned side (unless you use a nonstandard assembler or define your
own macros).</p>
<p>(The distinction between signed and unsigned comparisons comes up
rarely, since they are the same unless one of the values involved has
the high-order bit [the sign bit] set. However, when the distinction is
significant, it can lead to trouble. An operating system's disk
allocator may sort disk blocks using a <tt>BGT</tt> instruction. After
some years. a site tries to configure a system with more than 2^31
bytes of disk storage. Everything grinds to a halt because <tt>BGT</tt> compares <tt>80000010</tt> hexadecimal to <tt>7FFFFFF0</tt> hexadecimal and incorrectly finds the latter address to be greater. A <tt>BHI</tt> instruction would have compared and sorted the addresses correctly.)</p>
<p>A note on using the condition codes: After a <tt>TST</tt> instruction, the overflow (<tt>V</tt>) condition code is cleared. This means that after <tt>TST</tt>, <tt>BLT</tt> is equivalent to <tt>BMI</tt>, and <tt>BGE</tt> is the same as <tt>BPL</tt>. Stylistically, <tt>BMI</tt> and <tt>BPL</tt> make more sense after a <tt>TST</tt> unless the value being tested is the difference of two other values.</p>
<h2>TRAPS FOR EXPERTS</h2>
<p>Some quirks of the 68000 are less intuitive and regularly catch
seasoned programmers. Some of these aspects of the implementation
suggest design difficulties and trade-offs in the processor; others
reflect the designers' ideas on what constitutes good programming.</p>
<p>Addresses and data are different. Most assemblers quietly assemble <tt>MOVE&nbsp;#0,An</tt> as a <tt>MOVEA</tt> (move to an address register) instruction without nagging the programmer about the distinction between <tt>MOVE</tt> and <tt>MOVEA</tt>. But the 68000 treats data and address values very differently.</p>
<p>Address operations (<tt>MOVEA</tt>, <tt>ADDA</tt>, etc.) are never byte-size.</p>
<p>Word values are sign-extended to 32 bits before being used in address operations. Thus, <tt>ADDA.W&nbsp;D1,A2</tt> extends the low-order word of <tt>D1</tt> before adding it to <tt>A2</tt>.
In the 68000, there is no such thing as a 16-bit address, so a
word-size value is converted to 32 bits before being used in address
operations.</p>
<p>Address operations never set condition codes; most data operations
do. This is useful in subroutines that return information in the
condition codes:</p>
<pre> TST.W D0         ;Set condition codes to return to caller.
 MOVE.L (SP)+,A0  ;Pop return address into A0.
 ADD.W #params,SP ;Deallocate &lt;params&gt; bytes of parameters.
 JMP (A0)         ;Return with condition codes still set</pre>
<p>(Note that the <tt>MOVE</tt> and <tt>ADD</tt> are translated into <tt>MOVEA</tt> and <tt>ADDA</tt> by the assembler.) The condition codes set by the <tt>TST.W</tt> are unaffected by the remainder of the exit sequence.</p>
<p>Another trap concerns loop operations. A loop ending with a <tt>DBcc</tt> instruction (such as <tt>DBEQ</tt>) loops until the condition <tt>cc</tt>
is true; this instruction can be thought of as "decrement and branch
back if condition false." This is confusing since, if you were to write
out several instructions to replace a <tt>DBEQ</tt>, they would contain a <tt>BNE</tt> to jump back to the top of the loop, not a <tt>BEQ</tt>.</p>
<p>If the condition being tested for is not detected (or if you're using <tt>DBRA</tt>),
the loop will stop when the counter reaches -1, not 0. If you want the
loop always to be executed once, you should enter it at the top with
the count already decremented by 1. For example, to search for the
first null (zero) byte in a table of N bytes pointed to by <tt>A1</tt>:</p>
<pre>     MOVE.W #N-1,D0      ;Start the loop counter one too low.
 LOOP                    ;Come here to test another byte.
     TST.B (A1)+         ;Is A1's byte zero? (Advance after testing)
     DBEQ D0,LOOP        ;If not zero AND D0 is still &gt;= 0, loop back.</pre>
<p>This loop will execute at most N times. It corresponds to Pascal's
"repeat... until" construct. For the equivalent of "while...do," which
doesn't necessarily enter the loop:</p>
<pre>     MOVE.W #N,D0  ;Start the loop counter normally.
     BRA LOOPSTART ;Don't fudge D0; jump to the loop end first.
 LOOP              ;This is the loop head.
     TST.B (A1)+   ;Is A1's byte zero? (Advance after testing)
 LOOPSTART         ;Enter here to check count before looping.
     DBEQ D0,LOOP  ;If not zero AND D0 is still &gt;= 0, loop back.</pre>
<p>If you're using <tt>DBcc</tt>, don't forget to initialize the condition codes so the <tt>DBcc</tt> doesn't fall through when you jump to it. In the code above, the <tt>MOVE.W&nbsp;#N,D0</tt> "primes" for the loop.</p>
<p>Also remember that the data register used to control the loop is
decremented as a word quantity. If it's possible to have more than 2^16
iterations, you have to nest two <tt>DBcc</tt> loops. For example, to checksum a list of bytes whose length is specified in the long word <tt>D0</tt>:</p>
<pre>         MOVEQ #0,D3     ;Initialize checksum.
         MOVE.W D0,D1    ;Low word of loop length in D1.
         MOVE.L, D0,D2   ;Get high word of loop length
         SWAP D2         ;in D2 to use for outer loop.
         BRA.S START     ;Enter at the end of the loop.
 LOOP:   ADD.B (A1)+,D3  ;Add the next byte into sum.
 START:  DBRA D1,LOOP    ;Inner loop: Loop on low word of D0.
         DBRA D2,LOOP    ;Outer loop: Loop on high word.</pre>
<p>Small adjustments to the stack pointer can be done with <tt>ADDQ</tt> (or <tt>SUBQ</tt>) <tt>#n,SP</tt>, but these instructions can change it by at most 8 bytes. The fastest way to change it by more than 8 bytes is with <tt>LEA&nbsp;n(SP),SP</tt>.</p>
<p>The 68000 does not allow you to execute a <tt>MOVE</tt> instruction with a destination relative to the program counter (<tt>PC</tt>).
In the view of the 68000 designers, code should not patch itself. If
you must change a table in the middle of code, you must point to it
with an instruction like <tt>LEA&nbsp;TABLE(PC),An</tt> and then alter it through <tt>An</tt>.
(Self-modifying code is especially bad for 68000 programs that may
someday run on the 68020, because the 68020's instruction cache
normally assumes that code is pure.)</p>
<p>For no apparent reason, the <tt>CLR</tt> instruction always reads from an operand before clearing it. But unlike <tt>BCLR</tt>, <tt>CLR</tt> doesn't set the condition codes. Never use <tt>CLR</tt> to write a zero to a memory-mapped device address if the device will be affected by the read. The <tt>Scc</tt> instruction and <tt>MOVE</tt>s from the status register also read before writing but are less likely to cause problems.</p>
<p>Don't confuse the <tt>EXG</tt> and <tt>SWAP</tt> commands. <tt>EXG</tt> exchanges the 32-bit contents of two registers. <tt>SWAP</tt> swaps the 16-bit halves of a single data register.</p>
<p>When indexing into an array, remember to multiply the index register
by the "stride" (bytes per element) of the array. For instance, if <tt>D0</tt> holds an index into an array of long words pointed to by <tt>A1</tt>, you must multiply the index by 4 to convert from long words to bytes:</p>
<pre> MULU #4,D0              ;Turn the array index into a byte offset.
 MOVE.L 0(A1,D0.L),D1    ;Pick up the long-word array element in D1.</pre>
<p>The <tt>EOR</tt> instruction must have a data register for the source, except for the immediate form of the instruction, <tt>EORI</tt>.</p>
<h2>CODING FOR SPEED: PRINCIPLES</h2>
<p>The secret of efficient code on the 68000 can be described using one
word: "registers." Suppose, for example, that you have two 32-bit
variables. If you keep them in registers, the time to add one to the
other with <tt>ADD.L&nbsp;D0,D1</tt> is 8 clock periods. If they're in memory pointed to by address registers, the time to add them with <tt>MOVE.L&nbsp;(A0),D0</tt> and <tt>ADD.L&nbsp;D0,(A1)</tt>
is 32 clock periods, four times slower! The moral of the story is
simple: Work hard to keep frequently used quantities in registers.</p>
<p>You can learn this important rule and others by studying instruction timing information (such as the tables in the <i>M68000 16/32-bit Microprocessor Programmer's Reference Manual</i>).
Times are given in clock periods, which I'll call cycles; a 10-MHz
processor executes 10 cycles per micro-second. In general, the tables
give the base time for each instruction. Most base times must have
additional times added in for the operands. For instance, the time to
execute <tt>AND.W&nbsp;D0,(A1)+</tt> is 8 cycles for a word-size <tt>AND</tt>-to-memory and 4 more cycles for the <tt>(A1)+</tt>
destination operand. (The source operand is "free" because it's a
register.) Thus, the entire instruction takes 12 cycles, or 1.2 &#181;sec on
a 10-MHz processor.</p>
<p>When you're trying to save a few cycles in a crucial loop, timing
tables can be useful as more than just a reference. They provide a
concise summary of the architecture -- sort of a shopping list of the
instructions available and the cost of each. When you're trying to
avoid preconceived notions of which instructions are suited to solving
a problem, this summary can remind you of alternatives and encourage
lateral drift in your thinking.</p>
<h2>CODING FOR SPEED: BASIC RULES</h2>
<p>The <tt>MOVEQ</tt>, <tt>ADDQ</tt>, and <tt>SUBQ</tt> instructions are great. For instance, it's faster to zero all 32 bits of a data register by using <tt>MOVEQ&nbsp;#0,Dn</tt> than it is to use <tt>CLR.L Dn</tt>. Remember that these instructions are limited to small numbers: <tt>MOVEQ</tt> can load values from -128 to 127 into a data register; <tt>ADDQ</tt> (<tt>SUBQ</tt>) can add (subtract) only values from 1 to 8 to (from) its destination.</p>
<p><tt>DBcc</tt> is especially efficient; use it whenever you can. (But beware the traps described above.)</p>
<p>Not all assemblers automatically produce "short" branches (branches
with 8-bit displacements). Check the output of your assembler to see if
it emits a short branch whenever possible. If not, you may have to use <tt>BRA.S</tt>. <tt>BSR.S</tt>, and <tt>Bcc.S</tt> in your source code instead of <tt>BRA</tt>, <tt>BSR</tt>, and <tt>Bcc</tt>.</p>
<p>Because a taken short branch is slower than an untaken one, try to
avoid taking most branches. For instance, if you have a loop searching
for a null, the simple way to search is with</p>
<pre> LOOP            ;Here to search for the next null.
   TST.B (A3)+   ;Check next byte; advance the search pointer.
   BNE.S LOOP    ;Loop back if not found.</pre>
<p>It takes only a bit more space to "unroll" one or more iterations of the loop:</p>
<pre> LOOP            ;Here to search for the next null.
   TST.B (A3)+   ;Check next byte; advance the search pointer.
   BEQ.S FOUND   ;If zero, exit the loop.
   TST.B (A3)+   ;Not zero: check another byte and advance.
   BNE.S LOOP    ;If still not found, loop back.
 FOUND           ;Come here when A1 points one past the null byte.</pre>
<p>If the character tested generally isn't zero, the <tt>BEQ.S</tt> usually goes untaken and is faster. You can unroll any number of iterations, adding <tt>TST.B</tt>/<tt>BEQ.S</tt>
pairs until the extra space consumed is no longer worth the diminishing
increase in speed (or the branches become long branches).</p>
<p>Addressing with <tt>(An)+</tt> is faster than with <tt>-(An)</tt>.
If you have the choice of which direction to go in a search or other
loop through memory, move upward. (Note that this is not true for the
destination operand of a <tt>MOVE</tt> instruction.)</p>
<p>Because <tt>(An)</tt> addressing is faster than <tt>x(An)</tt>,
access to the first element of a data structure is faster than to the
others. (This is also useful with Pascal records, C structures, etc.)</p>
<p>The <tt>MOVEM</tt> instruction is a very efficient way to stack or
unstack a large number of registers. But if you have to push only two
registers, or pop three, <tt>MOVEM</tt> is no faster than moving them one at a time.</p>
<p>Don't assume that long operations are always slower than word-size
ones. For instance, word address operations can be slower than long
ones because of the time to sign-extend a word value.</p>
<p>As with other machines, never multiply or divide by a power of 2
when you can shift instead. Although shifts are time-consuming, they're
always faster than a multiplication or division. So, you can use the <tt>ASL</tt> (arithmetic shift left) instruction to multiply by a power of 2 and use <tt>ASR</tt>
(arithmetic shift right) to divide by a power of 2. (Be careful here --
the right shift is not the same as a division if the contents of the
register are negative. For example, -1 divided by 2 should be zero, but
-1 shifted right by 1 bit is -1, rounded incorrectly.) Don't forget
that the multiplication instructions produce a long-word result from a
word operand; shifting doesn't.</p>
<p>To multiply by 2, add a register to itself instead of shifting: <tt>ADD&nbsp;Dn,Dn</tt>. In fact, if you are multiplying a word operand by 4, you can do it faster with two <tt>ADD</tt> instructions than with a single shift by 2 bits.</p>
<p>Similarly, in doing extended-precision arithmetic, you can replace the common operation <tt>ROXL&nbsp;#1,Dn</tt> with <tt>ADDX&nbsp;Dn,Dn</tt> and save 2 or 4 cycles, depending on whether the operands are words or longs.</p>
<p>You can compute certain multiplications faster with shifts even if they're not powers of 2. For instance, to multiply <tt>D0</tt> by 17, add <tt>D0</tt> to 16 times <tt>D0</tt>:</p>
<pre> MOVE D0,D1      ;Copy D0 to D1.
 LSL #4,D0       ;Compute 16xD0 in D0.
 ADD D1,D0       ;Add original value in to compute 17xD0.</pre>
<p>Computing products this way is still faster than the 40-plus cycles for a multiply instruction.</p>
<p>The cost of maintaining the stack can be lessened if arguments are
deleted after the call by the caller, not the subroutine. (Most C
compilers use this stack protocol.) If the stack doesn't have to be
cleaned up after every call, you can allow debris from several calls in
a row to accumulate as long as it's easy to keep track of how much
there is. Typically, you can let it pile up until you reach a branch,
then unstack it all with an <tt>ADDQ</tt> (or <tt>LEA</tt> if there's more than 8 bytes to remove).</p>
<p>Finally, don't ignore the 68000's "higher-level" instructions. Even at the assembler level, instructions such as <tt>PEA</tt>, <tt>LINK</tt>, <tt>UNLK</tt>, and <tt>CHK</tt> can be very useful.</p>
<h2>CODING FOR SPEED: SOME COOKBOOK EXAMPLES</h2>
<p>Here are a variety of things you can do to save time when you're
scraping for cycles. Some are useful in many applications; others are
very specialized. The more obscure ones are examples of the kinds of
tricks that the 68000 can do.</p>
<p>Remember that timings will not be the same on the 68000's relatives
(the 68008, 68010, 68020, etc.). If you're working on one of these
processors, recompute the timings or, when you're not sure which of two
approaches is faster, measure the speed of both. Timings for the 68020
will be especially hard to compute because of its sophisticated
prefetch and instruction caching.</p>
<p>You should also know that not all computers run the processor at the
advertised speed. For instance, the Macintosh's 68000 runs at 7.8 MHz,
but it can't always operate at this speed because the screen is
memory-mapped and "steals" some memory cycles. Thus, the effective
speed of the Macintosh is about 6 MHz, but only memory cycles are
slowed down -- CPU cycles are unaffected. So operations done mostly
within the CPU (such as multiply, divide. and long register shifts) run
at nearly full speed. The lesson in all this is that timings are hard
to compute or intuit; you may want to time various pieces of code for
yourself to see which is faster.</p>
<p>It is said that one doesn't really know how to use a tool until one
knows three ways to abuse it. Here are some of my favorite ways to
abuse the 68000.</p>
<p><b>Fast subroutine calls.</b> Although <tt>JSR</tt> and <tt>RTS</tt>
provide a simple subroutine call-and-return, the cost of pushing the
return address on the stack is significant. For a very frequently
called subroutine, you can change the call to store the return in an
address register as in</p>
<pre>  LEA RETURN,A0  ;Return address goes in A0.
  JMP routine    ;Jump to the subroutine
 RETURN          ;A0 points to this spot.</pre>
<p>Then to return, just <tt>JMP (A0)</tt>. By avoiding use of memory, this saves 8 cycles. (Note that the <tt>LEA</tt> instruction references the label <tt>RETURN</tt> with PC-relative addressing.)</p>
<p>Also, if you end a subroutine with</p>
<pre> JSR lastsub     ;Call one last subroutine
 RTS             ;and return.</pre>
<p>and lastsub doesn't alter the stack, you can save a whopping 24
cycles by using "tail recursion" to replace the two instructions with a
single</p>
<pre> JMP lastsub     ;"Call" lastsub and it'll return for us.</pre>
<p>Finally, if you call a subroutine and then branch somewhere else, you can avoid extra jumping around. For instance,</p>
<pre> JSR sub         ;Make a call
 JMP next        ;and go somewhere else.</pre>
<p>can be made slightly faster with</p>
<pre> PEA next        ;Push a fake return address
 JMP sub         ;and "call."
                 ;sub will RTS to next for us.</pre>
<p>(All of the above work for <tt>BSR</tt> and <tt>BRA</tt> as well as <tt>JSR</tt> and <tt>JMP</tt>.)</p>
<p><b>Quick test for zero.</b> If you want to test whether a register is zero and don't mind trashing the value, use <tt>DBRA&nbsp;Dn,NOTZERO</tt> instead of combining <tt>TST.W&nbsp;Dn</tt> with <tt>BNE&nbsp;NOTZERO</tt>.</p>
<p>If you want to do an N-way branch depending on a value, you'll
usually want to index into a jump table and transfer to the appropriate
address. A "case" statement is typically implemented this way. But if
you have a very small number of values and want to handle the lower
values more quickly, a series of <tt>DBRA</tt>s can do this conveniently. For example, if you want to branch based on a register that contains 0, 1, or 2,</p>
<pre>   DBRA D0,NOT0  ;Decrement; jump if it wasn't zero.
     &lt;handle zero case&gt;
 NOT0            ;Come here if not zero. D0 has been decremented.
   DBRA D0,NOT1  ;Decrement; jump if original D0 wasn't one.
     &lt;handle one case&gt;
 NOT1            ;Not one. D0 has been decremented twice.
   DBRA D0,ERROR ;Decrement; if not originally two, error.
     &lt;handle two case&gt;</pre>
<p><b>Checking for membership in a small set.</b> If you want to see if
a number is in a set of several numbers, you can create a bit mask
corresponding to the set. For instance, if the set is {0,1,3,5}, the
mask has those bits set and the bit map is <tt>00101011</tt> (<tt>2B</tt> hexadecimal). You can test for membership in this set with</p>
<pre> BTST D0,#$2B    ;Is D0 in {0,1,3,5}?</pre>
<p>If your set is composed of more than eight elements you have to move the mask into a data register first.</p>
<p><b>Quick comparisons.</b> To check the value of a data register with <tt>CMP.L&nbsp;#xxx,Dn</tt> takes 14 cycles. If the value being tested for is small enough to fit in a <tt>MOVEQ</tt>, it's shorter and faster to put the value in a temporary register:</p>
<pre> MOVEQ #xxx,D0   ;Set up value to look for
 CMP.L D0,Dn     ;and do the comparison.</pre>
<p>If the value xxx is between -8 and 8, and you don't mind altering the data register, you can just use <tt>SUBQ&nbsp;#xxx,Dn</tt> (or <tt>ADDQ</tt>, as appropriate) instead of a <tt>CMP</tt>. Then you can use a conditional branch just as you would after a <tt>CMP</tt>. This works for word or long-word comparisons.</p>
<p><b>Picking up an unaligned word.</b> The straightforward approach is
to load 1 byte, shift it into position, and load the second byte. The
faster way (28 cycles instead of 38) is to exploit the stack pointer's
odd behavior when byte quantities are pushed on the stack:</p>
<pre> MOVE.B (A0)+,-(SP)      ;First byte to high half of new word on stack.
 MOVE.W (SP)+,D0         ;Pop that new word to D0. First byte in place.
 MOVE.B (A0),D0          ;Second byte in place.</pre>
<p><b>Clearing address registers.</b> <tt>MOVE.L&nbsp;#0,An</tt> takes 12 cycles, while <tt>SUB.L&nbsp;An,An</tt> takes only 8 and is shorter. (<tt>CLR</tt> doesn't work with address registers.)</p>
<p><b>Avoiding long shifts and rotates.</b> The time the 68000 takes to
shift a register is proportional to the distance being shifted: 2
additional cycles for every bit. Thus, never rotate a long word more
than 16 bits in either direction or a word more than 8 bits. (Remember
that to shift by more than 8 bits, you have to put the shift count into
a data register. In the examples that follow, the bit count is not a
constant; the value is bracketed to show this.)</p>
<pre> ROL.L &lt;16+x&gt;,Dn = ROR.L &lt;16-x&gt;,Dn
 ROL.W &lt; 8+x&gt;,Dn = ROR.W &lt; 8-x&gt;,Dn</pre>
<p>In shifting 16 bits or more, the first 16 bits of the shift can be done with a <tt>SWAP</tt> to save 26 cycles in each of these cases:</p>
<pre> LSR.L &lt;16+x&gt;,Dn =
   CLR.W Dn      ;Clear bits that swap up
   SWAP Dn       ;and LSR.L #16,Dn,
   LSR.W &lt;x&gt;,Dn  ;Now finish the shift.

 ASR.L &lt;16+x&gt;,Dn =
   SWAP Dn       ;Slide down 16 bits.
   EXT.L Dn      ;Sign-extend to a long word.
   ASR.W &lt;x&gt;,Dn  ;Finish up, sign-extended.

 LSL.L &lt;16+x&gt;,Dn =
   LSL.W &lt;x&gt;Dn   ;Shift x bits in low half.
   SWAP Dn       ;Shift 16 more bits.
   CLR.W Dn      ;Throw away bottom half.</pre>
<p>And some long-word operations of less than 16 bits can be optimized with <tt>SWAP</tt>. Long shifts between 11 and 15 bits can be speeded up with</p>
<pre> LSL.L &lt;x&gt;,Dn =
   SWAP Dn         ;Rotate left by 16 bits.
   ROR.L &lt;16-x&gt;,Dn ;Undo to x-bit left rotate.
   AND.W #mask,Dn  ;Remove bottom x bits.

 LSR.L &lt;x&gt;,Dn =
   AND.W #mask,Dn  ;Remove bottom x bits.
   SWAP Dn         ;Rotate right by 16 bits.
   ROL.L &lt;16-x&gt;,Dn ;Undo to x-bit right rotate.</pre>
<p><b>Fast sign-extend.</b> While there are instructions to sign-extend
bytes into words or words into long words, what if you have a signed
12-bit field (from unpacking a record or reading a DAC)? The standard
way to sign-extend this to a full 16-bit word is with</p>
<pre> LSL.W #16-12,Dn ;Shift so 12-bit field is left-justified.
 ASR.W #16-12,Dn ;Shift it back down sign-extended.</pre>
<p>If you know that the bits outside the 12-bit field are zero, you can
do this without shifting. In general, if you want to sign-extend an
N-bit field to 16 bits, define "mask" to be -(2^(N-1)) -- a mask with
the bottom N+1 bits clear. Then the sign extension can be done using a
temporary register:</p>
<pre> MOVE.W #mask,D0 ;Build mask with high N+1 bits set.
 ADD.W D0,Dn     ;Negative: top bits=0.
                 ;Positive: top bits=1.
 EOR.W D0,Dn     ;Flip so top bits are correct.</pre>
<p>This is always at least as fast as the shifting method, which gets
slower as N increases. Sign-extending to a long word is faster this way
if N is 3 or more.</p>
<p><b>Loading large constants.</b> To move certain values into the upper half of a data register, you might code <tt>MOVE.L&nbsp;#00xx0000,Dn</tt>. It's faster to replace this single instruction with two:</p>
<pre> MOVEQ #xx,Dn    ;Move value to lower half and clear upper half.
 SWAP Dn         ;Swap -- put things in position.</pre>
<p><b>Clearing the upper half of a data register.</b> Instead of doing this with <tt>AND.L&nbsp;#$FFFF,Dn</tt>, it's quicker to use</p>
<pre> SWAP Dn    ;Swap high and low halves.
 CLR.W Dn   ;Clear high half while it's low.
 SWAP Dn    ;Put things back in place.</pre>
<h2>CONCLUSIONS</h2>
<p>Esoteric coding techniques continue to be important in pushing
processors to their limits. A machine such as 68000, which is oriented
toward executing compiled high-level languages, can still be
appropriate for tight hand-crafted solutions. A programmer who needs
the utmost in performance can exploit quirks in an instruction set to
great advantage.</p>
<h2>BIBLIOGRAPHY</h2>
<ul>
<li>Harmon, Thomas L., and Barbara Lawson. <i>The Motorola 68000 Microprocessor Family.</i> Englewood Cliffs, NJ: Prentice-HaIl, 1985.
</li><li>Kane, Gerry, Doug Hawkins, and Lance Leventhal. <i>68000 Assembly Language Programming.</i> Berkeley, CA: Osborne/McGraw-Hill, 1981.
</li><li>Motorola Inc. <i>M68000 16/32-bit Microprocessor Programmer's Reference Manual.</i> Englewood Cliffs, NJ: Prentice-Hall, 1984.
</li><li>Scanlon, Leo J. <i>The 68000: Principles and Programming.</i> Indianapolis, IN: Howard W. Sams, 1981.
</li><li>Starnes, Thomas W. "Design Philosophy Behind Motorola's MC68000." BYTE, April-June 1983.
</li><li>Williams, Steve. <i>Programming the 68000.</i> Berkeley, CA: Sybex, 1985.
</li></ul>
<hr>
<p><i>Mike Morton received his B.A. in mathematics from Dartmouth
College. He is currently a senior software engineer for Lotus
Development Corporation (161 First St., Cambridge, MA 02142).</i></p>
</body></html>